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Abstract 

This paper investigates the problem of active learning for binary label prediction on a 
graph. We introduce a simple and label-efficient algorithm called S^for this task. At each step, 
selects the vertex to be labeled based on the structure of the graph and all previously gath¬ 
ered labels. Specifically, S^queries for the label of the vertex that bisects the shortest shortest 
path between any pair of oppositely labeled vertices. We present a theoretical estimate of the 
number of queries S^needs in terms of a novel parametrization of the complexity of binary 
functions on graphs. We also present experimental results demonstrating the performance of 
S^on both real and synthetic data. While other graph-based active learning algorithms have 
shown promise in practice, our algorithm is the first with both good performance and theoret¬ 
ical guarantees. Finally, we demonstrate the implications of the S^algorithm to the theory of 
nonparametric active learning. In particular, we show that achieves near minimax optimal 
excess risk for an important class of nonparametric classification problems. 

Keywords: active learning on graphs, query complexity of finding a cut, nonparametric clas¬ 
sification 


1 Introduction 

This paper studies the problem of binary label prediction on a graph. We suppose that we are given 
a graph over a set of vertices, where each vertex is associated with an initially unknown binary 
label. For instance, the vertices could represent objects from two classes, and the graph could 
represent the structure of similarity of these objects. The unknown labels then indicate the class 
that each object belongs to. The goal of the general problem of binary label prediction on a graph 
is to predict the label of all the vertices given (possibly noisy) labels for a subset of the vertices. 
Obtaining this initial set of labels could be costly as it may involve consulting human experts or 
expensive experiments. It is therefore of considerable interest to minimize the number of vertices 
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whose labels need to be revealed before the algorithm can predict the remaining labels accurately. 
In this paper, we are especially interested in designing an active algorithm for addressing this 
problem, that is, an algorithm that sequentially and automatically selects the vertices to be labeled 
based on both the structure of the graph and the previously gathered labels. We will now highlight 
the main contributions of the paper: 

• A new active learning algorithm which we call S^. In essence, S^, which stands for short¬ 
est shortest path operates as follows. Given a graph and a subset of vertices that have been 
labeled, S^picks the mid-point of the path of least length among all the shortest paths con¬ 
necting oppositely labeled vertices. As we will demonstrate in Sec 4, S^automatically adapts 
itself to a natural notion of the complexity of the cut-set of the labeled graph. 

• A novel complexity measure. While prior work on graph label prediction has focused 
only on the cut-size (i.e., the number of edges that connect oppositely labeled nodes) of 
the labeling, our refined complexity measure (cf. Section 4.1) quantifies the difficulty of 
learning the cut-set. Roughly speaking, it measures how clustered the cut-set is; the more 
clustered the cut-set, the easier it is to find. This is analogous to the fact that the difficulty of 
classification problems in standard settings depends both on the size and the complexity of 
the Bayes decision boundary. 

• A practical algorithm for non-parametric active learning. Significant progress has been 
made in terms of characterizing the theoretical advantages of active learning in nonparamet- 
ric settings (e.g., Castro and Nowak [2008], Hanneke [2011], Koltchinskii [2010], Minsker 
[2012], Wang [2011]), but most methods are not easy to apply in practice. On the other 
hand, the algorithm proposed in Zhu et al. [2003b], for example, offers a flexible approach 
to nonparametric active learning that appears to provide good results in practice. It however 
does not come with theoretical performance guarantees. A contribution of our paper is to fill 
the gap between the practical method of Zhu et al. [2003b] and the theoretical work above. 
We show that S^achieves the minimax rate of convergence for classification problems with 
decision boundaries in the so-called box-counting class (see Castro and Nowak [2008]). To 
the best of our knowledge this is the first practical algorithm that is minimax-optimal (up to 
logarithmic factors) for this class of problems. 

1.1 Related Work 

Label prediction on the vertices of a graph is an important and challenging problem. Many prac¬ 
tical algorithms have been proposed for this (e.g., Blum and Chawla [2001], Blum et al. [2004], 
Dasgupta and Hsu [2008], Zhu et al. [2003a]), with numerous applications such as information 
retrieval [Joachims, 2003] and learning intracellular pathways in biological networks [Chasman 
et al., 2009], among others. Theoretically, however, a complete understanding of this problem has 
remained elusive. In the active learning version of the problem there even appears to be contra¬ 
dictory results with some supporting the benefit of active learning [Afshani et al., 2007, Dasgupta 
and Hsu, 2008] while others being pessimistic [Cesa-Bianchi et al., 2010]. In this paper, we pro¬ 
pose a new and simple active learning algorithm, S^, and examine its performance. Our theoretical 
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analysis of S^, which utilizes a novel parametrization of the complexity of labeling functions with 
respect to graphs, clearly shows the benefit of adaptively choosing queries on a broad class of prob¬ 
lems. The authors in Cesa-Bianchi et al. [2010] remark that “adaptive vertex selection based on 
labels” is not helpful in worst-case adversarial settings. Our results do not contradict this since we 
are concerned with more realistic and non-adversarial labeling models (which may be deterministic 
or random). 

Adaptive learning algorithms for the label prediction on graphs follow one of two approaches. 
The algorithm can either (i) choose all its queries upfront according to a fixed design based on the 
structure of the graph without observing any labels (e.g., in Cesa-Bianchi et al. [2010], Gu and 
Han [2012]), or (ii) pick vertices to query in a sequential manner based on the structure of the 
graph and previous collected labels (e.g., Afshani et al. [2007], Zhu et al. [2003b])^. The former 
is closely related to experimental design in statistics, while the latter is adaptive and more akin to 
active learning; this is the focus of our paper. 

Another important component to this problem is using the graph structure and the labels at 
a subset of the vertices to predict the (unknown) labels at all other vertices. This arises in both 
passive and active learning algorithms. This is a well-studied problem in semi-supervised learning 
and there exist many good methods [Blum and Chawla, 2001, Gu and Han, 2012, Zhou et al., 2004, 
Zhu et al., 2003a]. The focus of our paper is the adaptive selection of vertices to label. Given the 
subset of labeled vertices it generates, any of the existing methods mentioned above can be used to 
predict the rest. 

The main theoretical results of our paper characterize the sample complexity of learning the 
cut-set of edges that partition the graph into disjoint components corresponding to the correct 
underlying labeling of the graph. The work most closely related to the focus of this paper is Afshani 
et al. [2007], which studies this problem from the perspective of query complexity [Angluin, 2004]. 
Our results improve upon those in Afshani et al. [2007]. The algorithm we propose is able to take 
advantage of the fact that the cut-set edges are often close together in the graph, and this can greatly 
reduce the sample complexity in theory and practice. 

Finally, our theoretical analysis of the performance of S^quantifies the number of labels re¬ 
quired to learn the cut-set and hence the correct labeling of the entire graph. It does not quantify 
the number of labels needed to achieve achieve a desired nonzero (Hamming) error level. To the 
best of our knowledge, there are no algorithms for which there is a sound characterization of the 
Hamming error. Using results from Guillory and Bilmes [2009], Cesa-Bianchi et al. [2010] takes a 
step in this direction but their results are valid only for trees. For more general graphs, such Ham¬ 
ming error guarantees are unknown. Nonetheless, learning the entire cut-set induced by a labeling 
guarantees a Hamming error of zero, and so the two goals are intimately related. 


2 Preliminaries and Problem Setup 

We will write G = {V, E) to denote an undirected graph with vertex set V and edge set E. Let 
/ : V —>■ {—1,4-1} be a binary function on the vertex set. We call f{v) the label of v e V. We 

'This distinction is important since the term “active learning” has been used in Cesa-Bianchi et al. [2010] and Gu 
and Han [2012] in the sense of (i) 
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will further suppose that we only have access to the labeling function / through a (potentially) 
noisy oracle. That is, fixing 7 e (0,0.5], we will assume that for each vertex v & V, the oracle 
returns a random variable f{v) such that f{v) equals —f{v) with probability 7 and equals f{v) 
with probability 1 — 7 , independently of any other query submitted to this oracle. We will refer 
to such an oracle as a 'j—noisy oracle. The goal then is to design an algorithm which sequentially 

selects a multiset^ of vertices L and uses the labels ^f{v),v G l| to accurately learn the true 
labeling function /. Since the algorithm we design will assume nothing about the labeling, it will 
be equipped to handle even an adversarial labeling of the graph. Towards this end, if we let C 
denote the cutset of the labeled graph, i.e., C = {{a:, y} & E : f{x) 7 ^ f{y)} and let dC denote 
the boundary of the cut-set, i.e., dC = {x G F : 3e G C* with x G e}, then our goal will actually 
be to identify dC. 

Of course, it is of interest to make L as small as possible. Given e G (0,1), the number of 
vertices that an algorithm requires the oracle to label, so that with probability at least 1 — e the 
algorithm correctly learns / will be referred to as its e—query complexity. We will now show that 
given an algorithm that performs well with a noiseless oracle, one can design an algorithm that 
performs well on a 7 —noisy oracle. 


Proposition 1 . Suppose A is an algorithm that has access to f through a noiseless oracle, and 
suppose that it has a e—query complexity q, then for each 7 G (0,0.5), there exists an algorithm A 

which, using a y—noisy oracle achieves a 2e—query complexity given by q x 2(0 5-7)2 log ( 7 ) 


The main idea behind this proposition is that one can build a noise-tolerant version of A by 
repeating each query that A requires many times, and using the majority vote. The proof of Propo¬ 
sition 1 is then a straightforward application of Chernoff bounds and we defer it to Appendix A. 

Therefore, to keep our presentation simple, we will assume in the sequel that our oracle is 
noiseless. A more nuanced treatment of noisy oracles is an interesting avenue for future work. 

It should also be noted here that the results in this paper can be extended to the multi-class 
setting, where / : V s {1,2,..., A:} by the standard “one-vs-rest” heuristic (see e.g.. Bishop 
et al. [2006]). However, a thorough investigation of the multiclass classification on graphs is an 
interesting avenue for future work. 


3 The S^algorithm 

The name S^signifies the fact that the algorithm bisects the shortest shortest-path connecting op¬ 
positely labeled vertices in the graph. As we will see, this allows the algorithm to automatically 
take advantage of the clusteredness of the cut set. Therefore, Sf ""unzips’" through a tightly clustered 
cut set and locates it rapidly (see Figure 1). 

The algorithm works by first constructing a label set L c F sequentially and adaptively and 
then using the labels in L to predict the labels of the vertices in ]/ \ L. It accepts as input a natural 
number budget, which is the query budget that we are given to complete the learning task. In 

multiset is a set that may contain repeated elements. 
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(a) Random sampling ends. 
One shortest shortest path 
is shown with thickened 
edges. 


(b) This path bisected and 
S^finds one cut edge. 


(c) Next shortest-shortest 
path is bisected. 


(d) This continues till 
S^unzips through the 
cut boundary. 


Figure 1: A sample run of the S^algorithm on a grid graph. The shaded and unshaded vertices represent 
two different classes (say -|-1 and —1 respectively). See text for explanation. 


Algorithm 1 S^: Shortest Shortest Path 

Input Graph G = {V, E), budget < n 
1: L^0 

2: while 1 do 

3: X -(r- Randomly chosen unlabeled vertex 

4: do 

5: Add (x, f{x)) to L 

6 : Remove from G all edges whose two ends have different labels. 

7: if l-^'l = BUDGET then 

8: Return labelCompletion(G, L) 

9: end if 

10: while X mssp(G, L) exists 

11: end while 


our theoretical analysis, we will show how big this budget needs to be in order to perform well 
on a wide variety of problems. Of course this budget specification merely reflects the completely 
agnostic nature of the algorithm; we subsequently show (see Section 3.1) how one might factor in 
expert knowledge about the problem to create more useful stopping criteria. 

S^(Algorithm 1) accepts a graph G and a natural number budget. Step 3 performs a random 
query. In step 6 , the obvious cut edges are identified and removed from the graph. This ensures that 
once a cut edge is found, we do not expend more queries on it. In step 7, the algorithm ensures that 
it has enough budget left to proceed. If not, it stops querying for labels and completes the labeling 
using the subroutine labelCompletion. In step 10, the mid-point of the shortest shortest path is 
found using the MSSP subroutine (Subroutine 2) and the algorithm proceeds till there are no more 
mid-points to find. Then, the algorithm performs another random query and the above procedure is 
repeated. As discussed in Proposition 1, if in Step 5, we compute f{x) as the majority label among 
O (log(n/e)) repeated queries about the label of x, then we get a noise tolerant version of S^. 

We will now describe the sub-routines used by S^. Each of these accepts a graph G and a 
set L of vertices. labelCompletion(G', L) is any off-the-shelf procedure that can complete 
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Sub-routine 2 MSSP 

Input Graph G = {V, E),LCV 
1: for each Vi, Vj G L such that /(uj) 7^ /(^j) do 
2: Pij ^ shortest path between u* and Vj in G 

3: iij ^ length of Pij {00 if no path exists) 

4: end for 

5: ^ argmin^.^L:fivi)^fivi) 

6: if {i*,j*) exists then 

7: Return mid-point of Pi*j* (break ties arbitrarily). 

8: else 

9: Return 0 

10: end if 


the labeling of a graph from a subset of labels. This could, for instance, be the graph min-cut 
procedure Blum and Chawla [2001] or harmonic label propagation Zhu et al. [2003a]. Since, 
we endeavor to learn the entire cut boundary, we only include this sub-routine for the sake of 
completeness so that the algorithm can run with any given budget. Our theoretical results do not 
depend on this subroutine. mssp(G, L) is Subroutine 2 and finds the midpoint on the shortest 
among all the shortest-paths that connect oppositely labeled vertices in L. If none exist, it returns 
0 . 

The main idea underlying the S^ algorithm is the fact that learning the labeling function / 
amounts to locating all the cut-edges in the labeled graph. Conceptually, the algorithm can be 
thought of as operating in two phases: random sampling and aggressive search. In the random 
sampling phase, the algorithm queries randomly chosen unlabeled vertices until it finds two ver¬ 
tices with opposite labels. Then our algorithm enters the aggressive search phase. It picks the 
shortest path between these two points and bisects it. What our algorithm does next sets it apart 
from prior work such as Afshani et al. [2007]. It does not run each binary search to the end, but 
merely keeps bisecting the shortest among all the shortest paths that connect oppositely labeled 
vertices observed so far. This endows the algorithm with the ability to “unzip” cut boundaries. 
Consider a sample run shown in Figure 1. The random sampling phase first picks a set of random 
vertices till an oppositely labeled pair is observed as in Figure 1(a). The shortest shortest path con¬ 
necting oppositely labeled nodes is shown here as a thick sequence of edges. Figure 1(b) shows 
that S^now bisects shortest shortest paths and subsequently finds a cut-edge. The bisection of the 
next two shortest shortest paths is shows in Figure 1(c) and we see the boundary unzipping feature 
of the algorithm emerges. Figure 1(d) finally shows the situation after the completion of the algo¬ 
rithm. Notice that an extremely small number of queries are used before S^completely discovers 
the cut boundary. 

3.1 Stopping Criterion 

Notice S^stops only if the budget is exhausted. This stopping criterion was chosen for two main 
reasons. Firstly, this keeps the presentation simple and reflects the fact that S^assumes nothing 
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about the underlying problem. In practice, extra information about the problem can be easily in¬ 
corporated. For instance, suppose one knows a bound on the cut-set or on the size of similarly 
labeled connected components, then such criteria can be used in Step 7. Similarly, one can hold 
out a random subset of observed labels and stop upon achieving low prediction error on this sub¬ 
set. Secondly, in our theoretical analysis, we show that as long as the budget is large enough, then 
S^recovers the cut boundary exactly. The larger the budget is, the more complex the graphs and 
labelings S^can handle. Therefore, our result can be interpreted as a quantification of the complex¬ 
ity of a natural class of graphs. In that sense S^is not only an algorithm but a tool for exposing the 
theoretical challenges of label prediction on a graph. 


4 Analysis 

Let C and dC be as defined in Section 2. As we observed earlier, the problem of learning the 
labeling function / is equivalent to the problem of locating all the cut edges in the graph. Clearly, 
if / could be arbitrary, the task of learning / given its value on a subset of its domain is ill-posed. 
However, we can make this problem interesting by constraining the class of functions that / is 
allowed to be in. Towards this end, in the next section we will discuss the complexity of a labeling 
function with respect to the underlying graph. 

4.1 Complexity of the Labeling Function with respect to the Graph 

We will begin by making a simple observation. Given a graph G and a labeling function /, notice 
that / partitions the vertices of the graph into a collection of connected components with identically 
labeled vertices. We will denote these connected components as Vi, 14,..., Vk, where k represents 
the number of connected components. 

Then, the above partitioning of the vertex set induces a natural partitioning of the cut set C = 
Ui<r<s<fc Crs, where Crs = {x,y G C : x G Vr,y G K}- That is, Crs contains the cut edges 
whose boundary points are in the components 14 and 14^. We denote the number of non-empty 
subsets Crs by m, and we call each non-empty Crs a cut component. See Figure 2(a) for an 
illustration. It shows a graph, its corresponding labeling (denoted by darker and lighter colored 
vertices) and the cut set (thickened lines). It also shows the induced vertex components 14,14,14 
and the corresponding cut components C 12 and (723. We will especially be interested in how 
clustered a particular cut component is. For instance, compare the cut component C12 between 
Figures 2(a) and 2(b). Intuitively, it is clear that the former is more clustered than the latter. As 
one might expect, we show that it is easier to locate well-clustered cut components. 

We will now introduce three parameters which, we argue, naturally govern the complexity of a 
particular problem instance. Our main results will be expressed in terms of these parameters. 

1. Boundary Size. The first and the most natural measure of complexity we consider is the 
size of the boundary of the cut set \dC\. It is not hard to see that a problem instance is hard if 
the boundary size induced by the labeling is large. In fact, \dC\ is trivially a lower bound on 

^Crs could of course be empty for certain pairs r, s. 
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the total number of queries needed to learn the location of C. Conversely, if it is the case that 
well-connected vertices predominantly have the same label, then \dC\ will most likely be small. 
Theorem 1 will show that the number of queries needed by the S^algorithm scales approximately 
linearly with \dC\. 

2. Balancedness. The next notion of label complexity we consider is the balancedness of 
the labeling. As we discussed above, the labeling function induces a partition on the vertex set 
Vi, V 2 ,..., 14 . We will define the balancedness of f with respect to G as /3 = mini<j<fc 

It is not hard to see why balancedness is a natural way of quantifying problem complexity. 
If there is a very small component, then without prior knowledge, it will be unlikely that we see 
labeled vertices from this component. We would then have no reason to assign labels to its vertices 
that are different from its neighbors. Therefore, as we might expect, the larger the value of /3, the 
easier it is to find a particular labeling (see Lemma 1 for more on this). 

3. Clustering of Cut Components. We finally introduce a notion of complexity of the labeling 
function which, to the best of our knowledge, is novel. As we show, this complexity parameter is 
key to developing an efficient active learning algorithm for general graphs. 

Let : 1/ X L N U { 0 , (X)} denote the shortest path metric with respect to the graph G, 
i.e., dfp{x, y) is the the length of the shortest path connecting x and yinG with the convention that 
the distance is 00 if x and y are disconnected. Using this, we will define a metric 5 : G x C ^ 
N U {0,00} on the cut-edges as follows. Let ei = {xi,yi} ,62 = {x2,y2} G C* be such that 
f{xi) = f{x 2 ) = -fland/(?/i) = /( 1 / 2 ) = - 1 . Then 5 (ei,e 2 ) = |/ 2 ) + l, 

where (S' — (7 is the graph G with all the cut-edges removed. Notice that 5 is a metric on G and 
that 5 ( 61 , 62 ) < cxD if and only if ei and 62 lie in the same cut-component. 

Imagine a scenario where a cut-component Crs is such that each pair e,e' G Grs satisfy 
^(e,e') < K. Now, suppose that the end points of one of the cut-edges ei G Grs has been dis¬ 
covered. By assumption, each remaining cut-edge in Grs hes in a path of length at most k from 
a pair of oppositely labeled vertices (i.e., the end points of ei). Therefore, if one were to do a 
bisection search on each path connecting these oppositely labeled vertices, one can find all the 
cut-edges using no more that |(7r.s| log k queries. If k is small, this quantity could be significantly 
smaller than n (which is what an exhaustive querying algorithm would do). The reason for the 
drastically reduced query complexity is the fact that the cut-edges in Grs are tightly clustered. A 
generalization of this observation gives rise to our third notion of complexity. 

Let Hr = (G, S) he a “meta graph” whose vertices are the cut-edges of G and {e, e'} E S iff 
5(e, e') < r, i.e.. Hr is the r-nearest neighbor graph of the cut-edges, where distances are defined 
according to 5. From the definition of 5, it is clear that for any r G M, has at least m connected 
components. 

Definition 1. A cutset C is said to be k— clustered if H,^ has exactly m connected components. 
These connected components correspond to the cut-components, and we will say that each indi¬ 
vidual cut-component is also K-clustered. 

Turning back to Figure 2, observe that the cut component G 12 is K-clustered for any k > 5 in 
Figure 2(a) and k— clustered for any k > 12 in Figure 2(b). The figure also shows a length 5 (resp. 
12) path in Fig 2(a) (resp. Fig 2(b)) that defines the clusteredness. 
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Figure 2: Two graphs with the same number of cut-edges (thickened edges) but the cut component C 12 in 
(a) is more “clustered” than its counterpart in (b). C 12 is 5-clustered in (a) and 12-clustered in (b). The 
corresponding length paths are shows with a (red) dotted line. 


4.2 The query complexity of 

In this section, we will prove the following theorem. 

Theorem 1. Suppose that a graph G = {V,E) and a binary function f are such that the induced 
cut set C has m cut components that are each k— clustered. Then for any e > 0, S^will recover C 
with probability at least 1 — e if the BUDGET is at least 

+ „(p„g^„] _ riog^^D + |ac| ([iog,Ki +1) (1) 


Before we prove the theorem, we will make some remarks: 

1. As observed in Afshani et al. [2007], using 0{\dC\) queries, we can create a greedy cover of 
the graph and reduce the length of the longest binary search from n to (|^) • With this simple 

modification to S^, the log2 n in (1) can be replaced by log(n/ \dC\). 

2. Suppose that G is a ^/n x ^fn grid graph where one half of this graph, i.e., a x ^/n/2 
rectangular grid, is labeled +1 and the other half —1. In this case, since m = 1, \dC\ = 0{^/n) 
and K = 3, Theorem 1 says that S^needs to query at most 0{yfi + logn) vertices. This is much 
smaller than the bound of O {^/n\ogn) which can be obtained using the results of Afshani et al. 
[2007]. 


In fact, when n < n/\dC\, it is clear that for m > 1, mlog{n/\dC\) + (|9G| — m) logK < 
\dC\ \og{n/\dC\), i.e., our bounds are strictly better than those in Afshani et al. [2007] 

3. As discussed in Section 2, the number of queries the (i.i.d) noise tolerant version of S^submits 

can be bounded by the above quantity times ^ log (^) and is guaranteed to recover C exactly 
with probability at least 1 — 2e. 


"^the first term in (1) is due to random sampling and is present in both results. We have also ignored integer effects 
in making this observation. 
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4. The first term in (1) is due to random sampling phase which can be quite large if /3 is very 
small. Intuitively, if is a very small connected component, then one might have to query almost 
all vertices before one can even obtain a label from Vi. Therefore, the “balanced” situation, where 
/5 is a constant independent of n, is ideal here. These considerations come up in Afshani et al. 
[2007] as well, and in fact their algorithm needs a priori knowledge of (5. Such balancedness 
assumptions arise in various other lines of work as well (e.g., Balakrishnan et al. [201 1], Eriksson 
et al. [2011]). Finally, we remark that if one is willing to ignore errors made on small “outlier” 
components, then (3 can be redefined in terms of only the sufficiently large Vi’s. This allows us to 
readily generalize our results to the case where one can approximately recover the labeling of the 
graph. 

5. As we argue in Appendix D, S^is near optimal with respect to the complexity parametrization 
introduced in this paper. That is, we show that there exists a family of graphs such that no algorithm 
has significantly lower query complexity than S^. It will be interesting to investigate the “instance- 
optimality” property of S^, where we fix a graph and then ask if S^is near optimal in discovering 
labelings on this graph. We take a step in this direction in Appendix D.2 and show that for the 
2-dimensional grid graph, the number of queries made by S^is near optimal. 

Proof, of Theorem 1 We will begin the proof with a definition and an observation. Recall (from 
Section 4.1) that the labeling function partitions the vertex set into similarly labeled components 
Vi,..., 14 . Suppose that W V V is such that for each i G {1,2,..., k}, \W Pi V^l > 1. Then, it 
follows that for each e G C, there exists a pair of vertices v, v' in W such that f{v) f f{v') and e 
lies on a path joining v and v'. We call such a set a witness to the cutset C. Since dC is a witness 
to the cut-set C, it is clearly necessary for any algorithm to know the labels of a witness to the cut 
set C in order to learn the cut set. 

Now, as described in Section 3, the operation of S^consists of two phases - (a) the random 
sampling phase and (b) the aggressive search phase. Observe that if S^knows the labels of a 
witness to the cut component Crs, then, until it discovers the entire cut-component, S^remains in 
the aggressive search phase. Therefore, the goal of the random sampling phase is to find a witness 
to the cut set. This implies that we can bound from above the number of random queries S^needs 
before it can locate a witness set, and this is done in the following lemma. 

Lemma 1. Consider a graph Q = iV,E) and a labeling function f with balancedness (3. For all 
a > 0, a subset L chosen uniformly at random is a witness to the cutset with probability at least 
1 — a, as long as 

log(l/(/ 3 a)) 

' '-log (1/(1-/!))• 

We refer the reader to Appendix B for the proof which is a straightforward combinatorial 
argument. 

Of course, since the random sampling phase and the aggressive search phase are interleaved, 
S^might end up needing far fewer random queries before a witness W to the cut-set has been 
identified. 

Now, we will turn our attention to the aggressive search phase. The algorithm enters the aggres¬ 
sive search phase as soon there are oppositely labeled vertices that are connected (recall that the 
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algorithm removes any cut-edge that it finds). Let £52 (G, L, f) be the length of the shortest among 
all shortest paths connecting oppositely labeled vertices in L, and we will suppress this quantity’s 
dependence on G,L,f when the context is clear. After each step of the aggressive search phase, 
the shortest shortest paths that connect oppositely labeled vertices gets bisected. Suppose that, dur¬ 
ing this phase, the current shortest path length is £32 = i E N. Then depending on its parity, after 
one step £52 gets updated to ^ (if £ is odd) or | (if £ is even). Therefore, we have the following 
result. 

Claim 1. If £32 = £, after no more than r = [log 2 £] -f 1 aggressive steps, a cut-edge is found. 

Proof. First observe that after r steps of the aggressive search phase, the length of the shortest 
shortest path is no more than Next, observe that if r > 4, then 2’’“^ < 2’’ — 2r + 1. 

The proof of this claim then follows from the following observation. Let us first suppose that 
£ > 8 , then setting r = [log 2 £] -|- 1 , we have that £ < 2 ’’“^ < 2 ’’ — 2 r -f 1 , which of course implies 
that < 1. Therefore, after at most r steps, the current shortest shortest path length drops to 

below 1, i.e., a cut-edge will be found. For the case when £ < 8 , one can exhaustively check that 
this statement is true. □ 

It is instructive to observe two things here: (a) even though £32 gets (nearly) halved after every 
such aggressive step, it might not correspond to the length of a single path through the span of the 
above r steps, and (b) at the end of r aggressive steps as above, at least one new boundary vertex 
is found, therefore, the algorithm might end up uncovering multiple cut-edges after r steps. 

To bound the number of active queries needed, let us split up the aggressive search phase 
of Shinto “runs”, where each run ends when a new boundary vertex has been discovered, and 
commences either after the random sampling phase exposes a new path that connects oppositely 
labeled vertices or when the previous run ends in a new boundary vertex being discovered. Let R 
be the total number of runs. Notice that R < \dG\. For each i e R, let Gi and Lj denote the graph 
and the label set at the start of run i. Therefore, by Claim 1, the total number of active learning 
queries needed can be upper bounded by Ylien {(G*, L,, /))] -\- 1 }. 

Now, observe that for each run in i ^ R, it trivially holds that £52 (Gj, L,, /) < n Now 
suppose that one cut-edge is discovered in a cut-component Grs- Since the graph on Grs induced 
by is K-connected by the assumption of the theorem, until all the cut-edges are discovered, 
there exists at least one undiscovered cut-edge in Grs that is at most k away (in the sense of ()) 
from one of the discovered cut-edges. Therefore, £32 is no more than k until all cuts in Grs have 
been discovered. In other words, for \Grs\ — 1 runs in R, £32 < k. 

Reasoning similarly for each of the m cut components, we have that there are no more than 
m “long” runs (one per cut-component) and we will bound £32 for these runs by n. From the 
argument above, after an edge from a cut-component has been found, £32 is never more than k till 
all the edges of the cut-component are discovered. Therefore, we have the following upper bound 

^As in the Remark 1 after the statement of the theorem, this can be improved to O (n/ \dC\) using a greedy cover 
of the graph [Afshani et al., 2007]. 


11 





on the number of aggressive queries needed. 

^{[log 2 {is 2 {Gi,LiJ))'] + 1} < I^CI +mriog 2 n] + {\dC\ - m) riog 2 K] 

ieR 

= m (|■log2 n] - [log2 k] ) + laC*! (|■log2 k] + 1 ) 

Putting this together with the upper bound on the number of random queries needed, we get the 
desired result. □ 

5 S^for Nonparametric Learning 

Significant progress has been made in terms of characterizing the theoretical advantages of active 
learning in nonparametric settings. For instance, minimax rates of convergence in active learning 
have been characterized under the so called boundary fragment assumption [Castro and Nowak, 
2008, Minsker, 2012, Wang, 2011]. This model requires the Bayes decision boundary to have a 
functional form. For example, if the feature space is [0, l]*^, then the boundary fragment model 
assumes that the Bayes decision boundary is defined by a curve of the form = f{xi ,..., x^^i), 
for some (smooth) function /. While such assumptions have proved useful for theoretical analysis, 
they are unrealistic in practice. Nonparametric active learning has also been analyzed in terms of 
abstract concepts such as bracketing and covering numbers [Hanneke, 2011, Koltchinskii, 2010], 
but it can be difficult to apply these tools in practice as well. The algorithm proposed in Zhu et al. 
[2003b] offers a flexible approach to nonparametric active learning that appears to provide good 
results in practice, but comes with no theoretical performance guarantees. A contribution of our 
paper is to fill the gap between the practical method of Zhu et al. [2003b] and the theoretical work 
above. The S^algorithm can adapt to nonparametric decision boundaries without the unrealistic 
boundary fragment assumption required by Castro and Nowak [2008], for example. We show that 
S^achieves the minimax rate of convergence for classification problems with decision boundaries 
in the so-called box-counting class, which is far less restrictive than the boundary fragment model. 
To the best of our knowledge this is the first practical algorithm that is near minimax-optimal for 
this class of problems. 

5.1 Box-Counting Class 

Consider a binary classification problem on the feature space [0, l]'^, d > 1. The box-counting 
class of decision boundaries generalizes the set of boundary fragments with Lipschitz regularity to 
sets with arbitrary orientations, piecewise smoothness, and multiple connected components. Thus, 
it is a more realistic assumption than boundary fragments for classification; see Scott and Nowak 
[2006] for more details. Let w be an integer and let denote the regular partition of [0,1]*^ into 
hypercubes of side length 1/w. Every classification rule can be specified by a set 5 C [0, on 
which it predicts +1. Let Ny,{B) be the number of cells in Pyj that intersect the boundary of B, 
denoted by dB. For ci > 0, define the box-counting class Bbox(ci) as the collection of all sets B 
such that Ny;{B) < ciw^~^ for all (sufficiently large) w. 
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5.2 Problem Set-up 

Consider the active learning problem under the following assumptions. 

A1 The Bayes optimal classifier B^, resides in Bbox(ci). The corresponding boundary dB^, di¬ 
vides [ 0 , 1 ]^ into k connected components^ (each labeled either +1 or — 1 ) and each with 
volume at least 0 < /3 < 1. Furthermore, the Hausdorff distance between any two compo¬ 
nents with the same label is at least Ai > 0 . 

A2 The marginal distribution of features P{X = x) is uniform over [0,1]^ (the results can be 
generalized to continuous distributions bounded away from zero). 

A3 The conditional distribution of the label at each feature is bounded away from 1/2 by a 
positive margin; i.e., \P{Y = 1|X = a:) — 1/2| > 7 > 0 for all x G [0,1]*^. 

It can be checked that the set of distributions that satisfy A1-A3 contain the set of distributions 
BF(1,1, C*!, 0, 7 ) from Castro and Nowak [2008]. 

Let G denote the regular square lattice on [0, with w vertices equispaced in each coordinate 
direction. Each vertex is associated with the center of a cell in the partition described above. 
Figure 3 depicts a case where d = 2, w = 15 and k = 2, where the subset of vertices contained 
in the set P* is indicated in red. The minimum distance Ai in Al above ensures that, for w 
sufficiently large, the lattice graph also consists of exactly k connected components of vertices 
having the same label. S^can be used to solve the active learning problem by applying it to the 
lattice graph associated with the partition In this case, when S^requests a (noisy) label of a 
vertex of G, we randomly draw a feature from the cell corresponding to this vertex and return its 
label. 

Near Minimax Optimality of S^. 

The main result of this section shows that the excess risk of S^for the nonparametric active learning 
problem described above is minimax-optimal (up to a logarithmic factor). Recall that the excess 
risk is the difference between the probability of error of the S^estimator and the Bayes error rate. 
The active learning minimax lower bound for excess risk is given by [Castro and Nowak, 

2008], a significant improvement upon the passive learning bound of [Scott and Nowak, 
2006]. Previous algorithms (nearly) achieving this rate required the boundary fragment assumption 
Castro and Nowak [2008], and so S^is near-optimal for a much larger and more realistic range of 
problems. 

Theorem 2. For any classification problem satisfying conditions A1-A3, there exists a constant 
G{k, /?, Al, 7 ) > 0 such that the excess risk achieved by S^with n samples on this problem is no 
more than , for n large enough. 

®with respect to the usual topology on 
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Graph : (n, \C\, 9(7 ) 

Mean 9(7-query 
complexity (10 trials) 

AFS ZLG BND 

Grid : (225, 32 ,68) 

1 V 2 : (400, 99, 92) 

4 V 9 : (400, 457, 290) 
CVR : (380, 530, 234) 

88.8 160.4 91 192 

96.4 223.2 102.6 370.2 
290.6 367.2 292.3 362.4 

235.8 332.1 236.2 371.1 


Table 1: Performance of S^, AFS, ZLG, END. 


To prove this theorem, we use Theorem 1 to bound the number of samples required by S^to be 
certain (with high probability) of the classification rule everywhere but the cells that are intersected 
by the boundary We then use the regularity assumptions we make on the distribution of 

the features and the complexity of the boundary to arrive at a bound on the excess risk of the 
S^algorithm. This allows us to estimate the excess risk as a function of the number of samples 
obtained. Refer to Appendix C for the details of the proof. 
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Figure 3: 15 x 15 grid graph used in experiments. The vertices with the red circle indicate the positive class. 
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6 Experiments 

We performed some preliminary experiments on the following data sets: 

(a) Digits: This dataset is from the Cedar Buffalo binary digits database originally Hull [1994]. 
We preprocessed the digits by reducing the size of each image down to a 16x16 grid with down- 
sampling and Gaussian smoothing Le Cun et al. [1990]. Each image is thus a 256-dimensional 
vector with elements being gray-scale pixel values in 0-255. We considered two separate binary 
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classification tasks on this data set: 1 vs 2 and 4 vs 9. Intuitively one might expect the former task 
to be much simpler than the latter. For each task, we randomly chose 200 digits in the positive 
class and 200 in the negative. We computed the Euclidean distance between these 400 digits 
based on their feature vectors. We then constructed a symmetrized 10-nearest-neighbor graph, 
with an edge between images iff z is among j’s 10 nearest neighbors or vice versa. Each task is 
thus represented by a graph with exactly 400 nodes and about 3000 undirected unweighted edges. 
Nonetheless, due to the intrinsic confusability, the cut size and the boundary (i.e., edges connecting 
the two classes) varies drastically across the tasks: 1 vs 2 has a boundary of 92, while 4 vs 9 has a 
boundary of 290. 

(b) Congressional Voting Records (CVR): This is the congressional voting records data set from 
the UCI machine learning repository [Bache and Lichman, 2013]. We created a graph out of this 
by thresholding (at 0.5) the Euclidean distance between the data points. This was then processed 
to retain the largest connected component which had 380 vertices and a boundary size of 234. 

(c) Grid: This is a synthetic example of a 15x15 grid of vertices with a positive core in the center. 
The core was generated from a square by randomly dithering its boundary. See Eigure 3. 

We compared the performance of four algorithms: (a) (b) AFS - the active learning algo¬ 

rithm from Afshani et al. [2007]; (c) ZLG - the algorithm from Zhu et al. [2003b]; and (d) BND 
- the experiment design-like algorithm from Gu and Han [2012]. 

We show the number of queries needed before all nodes in dC have been queried. This num¬ 
ber, which we call OG-query complexity, is by definition no smaller than \dC\. Notice that before 
completely querying dC, it is impossible for any algorithm to guarantee zero error without prior 
assumptions. Thus we posit that OC-query complexity is a sensible measure for the setting consid¬ 
ered in this paper. In fact OG-query complexity can be thought of as the experimental analogue of 
the theoretical query complexity of Section 4. These results are shown in Table 1. The bold figures 
show the best performance in each experiment. As can be seen, S^clearly outperforms AES and 
BOUND as suggested by our theory. It is quite surprising to see how well ZLG performs given 
that it was not designed with this objective in mind. We believe that trying to understanding this 
will be a fruitful avenue for future work. 
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A Proof of Proposition 1 


Proposition 1. Suppose A is an algorithm that has access to f through a noiseless oracle (i.e., 
a Onoisy oracle), and suppose that it has a e—query complexity of q, then for each 7 G (0, 0.5), 
there exists an algorithm A which, using a 'y—noisy oracle achieves a 2e—query complexity given 

2(0.5-7)2 log (7) 


Proof. Given a 7 > 0, one can design A as follows. A simply runs .A as a sub-routine. Suppose 
A requires the label of a vertex v E V to proceed, A intercepts this label request and repeatedly 
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queries the 7 —noisy oracle r (as defined above) times for the label of v. It then returns the majority 
vote f{v) as the label of v to A. The probability that such an algorithm fails can be bounded as 
follows. 


P 


A fails after rq queries 


< 


3^; G y s.t. f{v) A f{v) 


A fails after q queries \ f{v) = f{v),yv & V 


+ P 

(a) 

< n X P 

< n X + e, 


f{v) A f{v) 


+ 6 


( 2 ) 

(3) 

(4) 

(5) 


where (a) follows from the union bound and fact that A has a e—query complexity of q. ( 6 ) 
follows from applying the Chernoff bound [Chernoff, 1952] to the majority voting procedure : 
P f{v) A f{'^) = P[Bin(r, 7 ) > 0.5 x r] < e- 2 ?-( 0 - 5 - 7 ) 2 ^ Therefore, if we set r as in the state¬ 
ment of the proposition, we get the desired result. □ 


B Proof of Lemma 1 

Lemma 1. Consider a graph Q = {V,E) and a labeling function f with balancedness jd. For all 
a > 0, a subset L chosen uniformly at random is a witness to the cutset with probability at least 
1 — a, as long as \L\ > 


log(l/(l-/3)) • 

Proof. The smallest component in V is of size at least {dn. Let 8 denote the event that there exists 
a component Vi such that fl L = 0 and \ti^ = 1 — jd. Then, using the union bound and ignoring 

“|L| 

integer effects, we have P [^^] < 4 • 4 ^ < where the last inequality follows from the fact that 

/3 < 1 . 

To conclude, we observe that if we pick |L| as stated in the lemma, then the right-hand side of 
the above equation drops below a. This concludes the proof. □ 


C Proof of Theorem 2 

Recall that we propose to run S^on the lattice graph G corresponding to the partition of [0,1]*^. 
And, recall that when S^requests a noisy sample of a vertex in G, a random feature is drawn from 
the cell that correpsonds to this vertex and its label is returned. Assumption A3 therefore implies 
that S^has access to a 7 —noisy oracle^. In what follows, we will derive bounds on the e—query 
complexity of S^in this setting assuming it has access to a noiseless oracle. Then, arguing as in 
Proposition 1, we can repeat each such query requested by S^a total of 2(0 log(w'^/e) times 

^To be precise, S^has access to a 7 —noisy oracle only when querying a vertex whose cell in P^, does not intersect 
the boundary However, in the sequel we will assume that S^fails to correctly predict the label of any vertex 

whose cell intersects the boundary, and therefore, this detail does not affect the analysis. 
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and take a majority vote in order to get the 2e—query complexity for a 7 —noisy oracle. So, in the 
sequel we will assume that we have access to a noise free oracle. 

We assume that w is sufficiently large so that w > 2A]“^}. The first condition is 

needed since each homogenously labeled component of the problem corresponds to at least /Sw'^ 
vertices of the lattice graph G and the second condition ensures that there are exactly k connected 
components in G. 

First, we observe that by Lemma 1, if S^randomly queries at least log(l//5e)/ log(l/(l — P)) 
vertices, then with probability greater than 1 — e it will discover at least one vertex in each of 
the k connected components. Next, we observe that since there are k connected components of 
vertices, there are no more than fc^/4 cut components in the cut-set^. As described in the proof 
of Theorem 1, once it knows a witness to the cut set, the number of queries made by S^can be 
bounded by adding the number of queries it makes to first discover one cut-edge per each of 
these cut components to the number of queries needed to perform local search in each of the cut 
components. Reasoning as in the proof of Theorem 1, for each cut component, S^requires no more 
than log w'^ queries to find one cut edge. To bound the number of queries needed for the local 
search, we need to bound the size of the boundary \dG\ and k, the clusteredness parameter (see 
Definition 1) of the cut set in G. Towards this end, observe that by assumption Al, there are at 
most cells of the partition Py^ that intersects with Since each cell has at most d2‘^~^ 

edges, we can conclude that \dG\ < 2cid{2wY~^. To bound n, let us fix a cut component and note 
that given one cut edge, there must exist at least one other cut on a two dimensional face of the 
hypercube containing the first cut edge. Furthermore, this cut edge is contained on a path of length 
3 between the vertices of the first cut edge. Since the boundaries of the homogenously labeled 
components in [ 0 , 1 ]^ are continuous (by definition), it is easy to see that there is an ordering of 
cut-edges in this cut component ei, 62 ,... such that the distances between them, per Definition 1, 
satisfy (5(ej, Cj+i) = 3. Since this holds true for all the cut components, this implies that the cut-set 
is 3-clustered, i.e, k = 3. Therefore, the complete boundary can be determined by labeling no 
more than ([log/?] -I- 1) \dG\ = Qcid{2wY~^ vertices (since [log3] = 2). As observed earlier, 
if we repeat each of the above queries 2(0 log(tc^/e) times, we can see that if the number of 
queries made by S^satisfies 


n > 


(^cY‘iwY~^ 


^ loc. , log(l//3e) \ \og{w‘^le) 
4^® +log(l/(l-/?)); 2(0.5- 7 )^’ 


( 6 ) 


one can ensure that with probability at least 1 — 2e, S^will only possibly make mistakes on the 
boundary of the Bayes optimal classifier 05*. Let 8 be this event and therefore < 2e. 

If we let S‘^{X) and 5*(X) denote the prediction of a feature X by S^and the Bayes optimal 

^Suppose there are zi components of label +1 and Z 2 components of label —1, then there are at most ziZ 2 cut 
components. The maximum value this can take when zi + Z 2 = fc is fc^/4 by the arithmetic mean - geometric mean 
inequality. 


19 






( 7 ) 


classifier respectively, observe that the excess risk ERlS"^] of S^satisfies 
ER[S^] = P [52 ^ y] - P [B,{X) Y\ 

< P[r] + P [^'(X) 7^ Y\S] - P [S,(X) ^ Y\£] (8) 

< 2e + P [7f e p] (9) 

pePwr\dBt 

= min{2e + ci2‘^w-\l}, (10) 


where (a) follows from conditioning on £. (b) follows from observing first that conditioned on £, 
S‘^{X) agrees with -B*(X) everywhere except on the cells of that intersect the Bayes optimal 
boundary dB^, and that on this set, we can bound P [5'^(X) 7 ^ Y\S] — P [B^{X) ^Y\S] < 1. The 
last step follows since by assumption Al, there are at most 2ci(2tt;)'^“^ vertices on the boundary, 
and since, by assumption A2, the probability of a randomly generated feature X belonging to a 
specific boundary cell is w~'^. Therefore, by taking e = 1/tc we have that the excess risk of S^is 
bounded from above by (2 + £ 12 ^^)^“^ and for this choice of e, the number of samples n satisfies 


n > 


^6ci(2m;)'^"^ 


^2 

+ — log + 
4 


logjw/^) \ 
iog(i/(i -/5))y 


log(w'^+^) 
2(0.5-7)2' 


( 11 ) 


Finaly, we conclude the proof by observing that if n satisfies (11) and if w is sufficiently large, there 
exists a constant C which depends only on ci, k, P, 7 , d such that C > (2 + ci2^)tc“^. 


D The Tightness of 

We will now argue that the upper bounds we derived in the main paper are tight. Towards this end, 
in what follows, we will assume that a witness (cf. Section 4) to the cut set is known. This allows 
us to separate the effect of the random sampling phase and the learning phase in such problems. 

D.l Parameter optimality of 

In this section, we will show that S^is near optimal with respect to the complexity parametrization 
introduced in Section 4.1. In particular, we will show that given particular values for n, k, m and 
\dC\, there exists a graph such that no algorithm can significantly improve upon S^. For what 
follows, we will set c = \dC\. Let us define 

V = { (n, c, m, k) G : m < c, c(k + 2) < n} . 

We will show that as long the given problem parameters are in V, S^is near optimal. While it is 
trivially true that the number of cut components, m has to be no more than c, the second condition 
places finer restrictions on the parameter values. These conditions are specific to the construction 
we present below and can be significantly weakened at the expense of clarity of presentation. 

We will now prove the following theorem. 
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r = 4, k=9 


Figure 4: G(4,9) 



Figure 5: p = 3 copies of G(4,9) linked together 


Theorem 3. Given a set of values for n,c,m,K G V, there exists a graph G on n vertices and a 
set of labelings T on these vertices such that each f E J- satisfies: 

• f induces no more than c cuts in the graph. 

• The number of cut-components is m 


• Each component is K-clustered. 
Furthermore, log | is no smaller than 


m\og I — 
m 


n 


LG(L¥J) + 2 


X 


K — 1 


+ 1 


+ 





Remark: Notice that log | is a lower bound on the query complexity of any algorithm that 
endeavors to learn c cuts in this graph that are decomposed into m components, each of which is 
K—clustered, even if the algorithm knows the values m, c, and k. 
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Now, this result tells us that if we assume, for the sake of simplicity, that m evenly divides c, k 
is odd, and c{k + 1 ) evenly divides 2nm (notice that c{k + 2 ) < n by assumption) then, we have 
that 


2nm 

—y -r X 

c{k + 1 )_ 

= m log ^ + (c — m) log 

Comparing this with Theorem 1 in the manuscript, we see that S^is indeed parameter optimal. 
Proof. First, define 


log |J^| > mlog 


(- 

\m 



A 

r = 


Lm. 


A; = 2 


P 


K — 1 


2n 


+ 


\_r {k — 1) + 4J 


If (n, c, m, k) G V, it can be shown that r > 1 and p > m. Let G(r, k) denote the following graph 
on vertices - two vertices are connected by r edge disjoint paths and each path has ^ 

vertices (and edges). This is shown in Fig 4 for r = 4 and A: = 9. G is constructed by linking 
p copies of G(r, k) and a “remainder” graph Grem which is a clique with n — p{r (^) + 2 ) as 
shown in Fig 5. We will denote these p copies of G(r, fc) as Gi,..., Gp. 

Let IF be the set of all labelings obtained as follows: 


1. Choose m out of the p subgraphs Gi,..., Gp without replacement. There are (^) ways to 
do this. 


2. In each of these m chosen subgraphs, pick r edges to be cut edges, one on each of the r 

paths. There are ways to do this in one subgraph. Therefore, there is a total number 

of ways to do this. 

3. Now, let the left most vertex of Gi be labeled +1, the rest of the labels are completely 
determined by the cut-set we just chose. 


Notice that for each f E F, the following holds: (a) there are exactly m cut-components in the 
graph, (b) the number of cuts is m x < c, and (c) in each cut component, the cuts are k < k 
close. 

The total number of such labelings is given by: 
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Therefore, we can lower bound log | as follows 


log \ J^\ > mlog + ^^^’log ^ ^ 


= mlog 
= mlog 

= mlog 


(- 

\m 

1 

m 

1 

m 


2 n 


r{k — 1) + 4_ 

2n 

n 

lSJ(L¥J) + 2 


m 


c 

LmJ 


log 


tv — 1 


+ m 

K — 1 


c 

m. 


log 


1 

K — 1 


+ 1 


m 


Lm. 


— m) log 


K — 1 


1 

11 . 


This concludes the proof. 


□ 


D.2 Two Dimensional Grid 

In this section, we will show that in the case of the 2-dimensional grid S^is near optimal even if we 
fix the graph before hand. Notice that in this sense, this result is stronger than the one presented 
in the previous section and is particularly relevant to the theory of nonparametric active learning. 
Consider the example of a 2-dimensional r x r grid, where the bottom-left vertex is labeled -1-1 
and the top-right vertex is labeled —1 (see Fig. 5). We want to count/lower bound the total number 
of labelings such that there are exactly 2 connected components and such that the cut-size of the 
labeling is no more than r. Notice that the logarithm of this number will be a lower bound on the 
query complexity of any algorithm that endeavors to learn a cut set of size no more than r. 

Theorem 4. The number of ways to partition an r x r grid into 2 components using a cut of size 
at most r such that the bottom-left vertex and top-right vertex are separated is lower bounded by 
2 ’'. 


Proof We will first restrict our attention to cuts of size exactly r. Consider the grid shown in 
Figure 6 . We will begin by making a few simple observations. Each of the (r — 1) x (r — 1) boxes 
that contains a cut, has to contain at least 2 cuts. Furthermore, since these cuts are only allowed to 
partition the grid into 2 connected components, cuts are located in contiguous boxes. Therefore, 
there are at most r — 1 boxes that contain cuts. 

We will think of a cut as a walk on the grid of (r — 1) x (m — 1) boxes (labeled as {i, j),l < 
fj < r — 1 in Fig 6 ) and lower bound the total number of such walks. Observe that for a walk to 
correspond to a valid cut, it must contain one of the boxes labeled S and one of the boxes labeled 
T. By symmetry, it suffices only consider walks that originate in an S box and end in a T box. 

To lower bound the number of valid walks, we are going to restrict ourselves to positive walks 
- walks that only move either right (R) or down (D). Notice that such walks traverse exactly r — 1 
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boxes. Towards this end, we first observe that there are 2(r —1) such walks each of which originates 
in an S'-block and terminates at the diametrically opposite T-block. These walks are made up of 
entirely R moves or entirely of D moves. Therefore, to count the number of remaining walks, we 
can restrict our attention to walks that contain at least one R move and one D move. Notice that 
such walks cannot cross over the diagonal. Therefore, by symmetry, it suffices to consider walks 
that start in an ^-box on the left column: {(l,l),...,(l,r — 1 )} and end in a T-box in the bottom 
row: {(1,1), (2, 1), ..., (r — 1,1)}. Suppose, for j > 2, we start a walk at block (1, j), then the 
walk has to make exactly j — 1 down moves and (m — 2 — j + 1) right moves (since the total 
number of blocks in the walk is r — 1). Therefore, the total number of such positive walks that 
originate from (1, j) is Reasoning similarly, we conclude that there are (jiD 

positive walks from one of the S'-boxes in the left column to one of the T-boxes in the bottom 
row. Finally, observe that the walk that starts at (1,1) and ends at (1, m — 1) correspond to two 
different cuts since the (l,r — 1) box has two valid edges that can be cut. Similarly the walk 
(1, r — 1) — • • • — (r — 1, r — 1) corresponds to 2 valid cuts. Therefore, the total number of 
cuts from such walks is given by 2(r — 1 ) + 2(2 + (jiD) “ “ 1 ) + 2’’“^, where the 

multiplication by 2 inside follows from the symmetry we used. 

Observe now that if we allow the cuts to be smaller than r, then the total number of cuts is 
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given by summing over the cut-sizes as follows: Y7i=2 2(* — 1) -h 2* ^ = (*^ 2 ^) + 2*^ — 2 > 2’’. This 
concludes the proof. □ 

Therefore, any algorithm will need to submit at least log(2'') = 0{r) queries before it can 
discover a cut of size at most r and in fact, from the proof above, this seems like a pretty weak 
lower bound (since we restricted ourselves only to positive walks). However, observe that since 
K = 3 and \dC\ < r here. Theorem 1 (from the manuscript) tells us that submits no more than 
O (r) queries for the same. Extending this argument to other families of graphs is an interesting 
avenue for future work. 
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